Estimating Activity based on Mobility Data#

Less movement typically means less economic activity. Understanding where and when population movement occurs can help inform disaster response and public policy, especially during crises.

Similarly to COVID-19 Community Mobility Reports, Facebook Population During Crisis and Mapbox Movement Data, we generate a series of crisis-relevant metrics, including the baseline and ongoing (sample population) density (n_baseline and count), percent change and z-score. The metrics are calculated by tallying the device count in each tile and at each time period drawn out from longitudinal mobility data and comparing to a baseline period.

Data#

In this section, we import from the data sources, available either publicly or via Foundational Datasets and Data Products Summary.

Area of Interest#

In this step, we import the clipping boundary and the H3 tessellation defined by area(s) of interest below.

AOI = geopandas.read_file("../../data/final/tessellation/SYRTUR_tessellation.gpkg")
Make this Notebook Trusted to load map: File -> Trust Notebook

Mobility Data#

Through the Development Data Partnership, the project team obtained longitudinal mobility data. We use the mobile location data the baseline and ongoing (sample population) density (n_baseline and count), percent change and z-score. The metrics are calculated by tallying the device count in each tile and at each time period drawn out from longitudinal mobility data and comparing to a baseline period.. For additional information, please see Data and Methodology.

ddf = dd.read_parquet(
    f"../../data/final/panels/{PANEL}",
    columns=["hex_id", "longitude", "latitude", "datetime", "uid", "month"],
)

Note

Due to the data volume and velocity (updated daily), the computation of the panel took place on AWS on an EC2 instance owned by the project team. The resulting aggregation is the tabulation of the device count for each hex_id and date.

First, we calculate the cardinality,

Hide code cell source
len(ddf)
185864506

And visualize,

Visualization of the mobility data panel composed of approximately 200 million points. Source: Veraset Movement.

Methodology#

The methodology presented consists of generating a series of crisis-relevant metrics, including the baseline(sample) population density, percent change and z-score based on the number of devices in an area at a time. The device count is determined for each tile and for each time period, as defined by data standards and the spatial and temporal aggregations below. Similar approaches have been adopted, such as in [Maas, 2019]. The metrics may reveal movement trends in the sampled population that may indicate more or less activity.

Data Standards#

Population Sample#

The sampled population is composed of GPS-enabled devices drawn out from longituginal mobility data. It is important to emphasize the sampled population is obtained via convenience sampling and that the mobility data panel represents only a subset of the total population in an area at a time, specifically only users that turned on location tracking on their mobile device. Thus, derived metrics do not represent the total population density.

Spatial Aggregation#

The metrics are spatially aggregated on H3 resolution 6 tiles. This is equivalent to approximately to an area of \(36 Km^2\) on average

Make this Notebook Trusted to load map: File -> Trust Notebook

H3 resolution 6 tile near Gaziantep, Türkiye. Gaziantep is among the most affected areas by the 2023 Türkiye–Syria Earthquake; a 2200-year-old Gaziantep Castle was destroyed after the seismic episodes.

Temporal Aggregation#

The metrics are aggregated daily in the Europe/Istanbul (UTC+3) timezone.

Implementation#

Calculate ACTIVITY#

In this step, we calculate ACTIVITY as a density. In other words, we calculate the total of devices aggregated into a daily tally that were detected within each area of interest. Please note that we use a spatial join (whether a device was once at least once with an area of interest), which is a simplistic approach compared to , for example, estimating stay locations.

ACTIVITY = (
    ddf.assign(date=lambda x: dd.to_datetime(ddf["datetime"].dt.date))
    .groupby(["hex_id", "date"])["uid"]
    .nunique()
    .to_frame("count")
    .reset_index()
    .compute()
)

Additionally, we create a column weekday that will come handy later on when standardizing.

ACTIVITY["weekday"] = ACTIVITY["date"].dt.weekday

Calculate BASELINE#

In this step, we choose the 4-week period spanning January 2, 2023 to January 29, 2023 as the baseline. The baseline is calculated for each tile and for each time period, according to the spatial and temporal aggregations.

BASELINE = ACTIVITY[ACTIVITY["date"].between("2023-01-02", "2023-01-29")]

In fact, the result are 7 different baselines for each tile. We calculate the mean device density for each tile and for each day of the week (Mon-Sun).

MEAN = BASELINE.groupby(["hex_id", "weekday"]).agg({"count": ["mean", "std"]})

Taking a sneak peek,

MEAN[MEAN.index.get_level_values("hex_id").isin(["862da898fffffff"])]
count.mean count.std
hex_id weekday
862da898fffffff 0 5819.75 2285.557901
1 6675.25 1918.023527
2 7020.00 2137.928281
3 6586.00 2345.257484
4 5671.50 2838.529490
5 6300.00 2516.413718
6 6891.75 2462.698029

Calculate Z-Score and Percent Change#

A z-score is a statistical measure that tells how above or below a particular data point is from the mean (average) of a group of data points, in terms of standard deviations. A z-score is particularly useful to standardize and make meaningful comparisons between different sets of data. By examining the z-scores, one can assess how closely a data set diverts from the mean, considering the variance. On the other hand, a percent change may be easier to interpret, but does not provide this information.

Creating StandardScaler for each hex_id,

scalers = {}

for hex_id in BASELINE["hex_id"].unique():
    scaler = StandardScaler()
    scaler.fit(BASELINE[BASELINE["hex_id"] == hex_id][["count"]])

    scalers[hex_id] = scaler

Joining with the area of interest (AOI),

ACTIVITY = ACTIVITY.merge(AOI, how="left", on="hex_id").drop(["geometry"], axis=1)

Finally, merging with the (mean) baseline,

ACTIVITY = pd.merge(ACTIVITY, MEAN, on=["hex_id", "weekday"], how="left")

Calculating the z_score for each tile,

for hex_id, scaler in scalers.items():
    try:
        predicate = ACTIVITY["hex_id"] == hex_id
        score = scaler.transform(ACTIVITY[predicate][["count"]])
        ACTIVITY.loc[predicate, "z_score"] = score
    except:
        pass
# ACTIVITY.to_csv("../../data/SYRTUR_mobility_activity.csv", index=False)
# ACTIVITY = pd.read_csv("../../data/SYRTUR_mobility_activity.csv")

Additionally, we calculate the percent change. While the z-score offers more robustness to outliers and numerical stability, the percent change can be used when interpretability is most important. Thus, preparing columns,

ACTIVITY["n_baseline"] = ACTIVITY["count.mean"]
ACTIVITY["n_difference"] = ACTIVITY["count"] - ACTIVITY["n_baseline"]
ACTIVITY["percent_change"] = 100 * (ACTIVITY["count"] / (ACTIVITY["n_baseline"]) - 1)
ACTIVITY.sort_values("date")
hex_id date count weekday ADM0_PCODE ADM1_PCODE ADM2_PCODE ADM1_EN distance distance_bin count.mean count.std n_baseline n_difference percent_change z_score
0 862c30097ffffff 2023-01-01 1 6 TR TUR021 TUR021015 DIYARBAKIR 374.393513 6 1.00 0.000000 1.00 0.00 0.000000 -0.267261
11872 862c3434fffffff 2023-01-01 1 6 TR TUR023 TUR023006 ELAZIG 335.215527 6 NaN NaN NaN NaN NaN 0.000000
11873 862c34357ffffff 2023-01-01 1 6 TR TUR023 TUR023008 ELAZIG 325.120193 5 NaN NaN NaN NaN NaN 0.000000
11876 862c34377ffffff 2023-01-01 1 6 TR TUR023 TUR023006 ELAZIG 322.406932 5 NaN NaN NaN NaN NaN NaN
11877 862c343a7ffffff 2023-01-01 4 6 TR TUR023 TUR023010 ELAZIG 305.144523 5 4.25 2.061553 4.25 -0.25 -5.882353 -0.304287
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
397590 862da1787ffffff 2023-09-18 2 0 TR TUR031 TUR031015 HATAY 170.260140 3 2.00 0.000000 2.00 0.00 0.000000 0.398862
396586 862c32cc7ffffff 2023-09-18 87 0 SY SY08 SY0802 Al-Hasakeh 374.831252 6 NaN NaN NaN NaN NaN NaN
399290 862db175fffffff 2023-09-18 4 0 SY SY12 SY1202 Dar'a 458.336675 8 NaN NaN NaN NaN NaN NaN
396591 862c32cdfffffff 2023-09-18 1 0 SY SY08 SY0802 Al-Hasakeh 380.716677 6 NaN NaN NaN NaN NaN NaN
404046 862da8827ffffff 2023-09-18 1 0 TR TUR027 TUR027008 GAZIANTEP 32.278982 0 10.50 2.081666 10.50 -9.50 -90.476190 -2.849294

408676 rows × 16 columns

ACTIVITY = ACTIVITY[(ACTIVITY["count"] >= 10) | (ACTIVITY["n_baseline"] >= 10)]

Taking a sneak peek,

hex_id date count n_baseline n_difference percent_change z_score ADM0_PCODE ADM1_PCODE ADM2_PCODE
245836 862da898fffffff 2023-06-14 4784 7020.0 -2236.0 -31.851852 -0.776792 TR TUR027 TUR027008
318112 862da898fffffff 2023-07-12 866 7020.0 -6154.0 -87.663818 -2.633175 TR TUR027 TUR027008
34538 862da898fffffff 2023-02-08 4689 7020.0 -2331.0 -33.205128 -0.821804 TR TUR027 TUR027008
354973 862da898fffffff 2023-08-02 86 7020.0 -6934.0 -98.774929 -3.002745 TR TUR027 TUR027008
81714 862da898fffffff 2023-03-08 8167 7020.0 1147.0 16.339031 0.826102 TR TUR027 TUR027008
... ... ... ... ... ... ... ... ... ... ...
402043 862d12ccfffffff 2023-09-12 12 NaN NaN NaN NaN TR TUR001 TUR001006
402044 862d12ccfffffff 2023-09-13 14 NaN NaN NaN NaN TR TUR001 TUR001006
402045 862d12ccfffffff 2023-09-14 15 NaN NaN NaN NaN TR TUR001 TUR001006
402046 862d12ccfffffff 2023-09-15 14 NaN NaN NaN NaN TR TUR001 TUR001006
402047 862d12ccfffffff 2023-09-16 14 NaN NaN NaN NaN TR TUR001 TUR001006

92512 rows × 10 columns

Findings#

The following map shows the z-score on each tile for each time period. The z-score shows the number of standard deviations that the data point diverges from the mean; in other words, whether the change in population for that area is statistically different from the baseline period.

Limitations#

The methodology presented is an investigative pilot aiming to shed light on the economic situation in Syria and Türkiye leveraging alternative data, especially when we are confronted with the absence of traditional data and methods.

Caution

In summary, beyond waiting for peer review, the limitations can be summarized in the following.

  • The methodology relies on private intent data. In other words, the input data, i.e. the mobility data, was not produced or collected to analyze the population of interest or address the research question as its primary objective but it was repurposed for the public good. The benefits and caveats when using private intent data have been discussed extensively in the World Development Report 2021 [World Bank, 2021].

  • On the one hand, the mobility data panel is spatially and temporally readily available and comprehensive, on the other hand it is generated through convenience sampling which constitutes an important source of bias. The panel composition is not entirely known and it is susceptible to change. In other words, the collection and composition of the mobility data panel cannot be controlled.

  • In summary, the results cannot be interpreted to generalize the entirety of population movement but can potentially provide information on movement panels to inform Syrian economic situation, considering time constraints and the scarcity of traditional data sources in the context of Syria.

References#

Maa19

Paige Maas. Facebook disaster maps: aggregate insights for crisis response & recovery. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD '19, 3173. New York, NY, USA, 2019. Association for Computing Machinery. URL: https://doi.org/10.1145/3292500.3340412, doi:10.1145/3292500.3340412.

WorldBank21

World Bank. World Development Report 2021 : Data for Better Lives. World Bank, 2021. License: CC BY 3.0 IGO. URL: http://hdl.handle.net/10986/35218.